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We apply a Bayesian "razor" to forecast Bayes factors between different parameterizations of the galaxy cluster mass 
function. To demonstrate this approach, we calculate the minimum size N-body simulation needed for strong evidence 
favoring a two-parameter mass function over one-parameter mass functions and visa versa, as a function of the minimum 
cluster mass. 
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1 Introduction 

The galaxy cluster mass function describes the abundance 
of virialized astronomical objects as a function of their 
mass. The mass function is exponentially sensitive to the 
initial conditions, composition, and evolution history of the 
Universe. Thus, it can provide a powerful observational 
tool to look for features in the standard ACDM cosmolog- 
ical model, including evolution of the dark energy equa- 
tion of state (e.g. Bhattacharya et al. 2010; Voit 2005 and 
references therein) and primordial non-Gaussianity (Des- 
jacques and Seljak 2010 and references therein). Numer- 
ous different parameterizations of the mass function have 
been proposed, with different numbers of free parameters 
(see Sect. 2). Current and future experiments will observe 
thousands of galaxy clusters (Geisbiisch and Hobson 2007; 
Menanteau et al. 2010; Rozo et al. 2010; Sehgal et al. 2007; 
Staniszewski et al. 2009), producing galaxy cluster cata- 
logs which can be used to constrain the shape of the clus- 
ter mass function. Yet will this data be sufficient to distin- 
guish statistically significant differences between parame- 
terizations? The minimum number of parameters required 
by data and/or simulations affects both data analysis (mod- 
els with fewer parameters are easier to work with) and the- 
ory (understanding the physical interpretation of additional 
parameters as well as their implications for cosmology). 
Uncertainties in the mass function parameterization can sig- 
nificantly affect cosmological constraints from cluster abun- 
dances (Cunha and Evrard 2010; Wu, Zentner, and Wechsler 
2010). 

Bayesian model selection is well-suited to address the 
question of how many mass function parameters are re- 
quired by data. Occam's razor implies that if two theories 
describe data equally well, the simpler explanation is prefer- 
able. Thus, to describe a relationship between two physical 
quantities, one would like to use a function with the fewest 
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number of parameters necessary. Since any physical data (or 
N-body simulation) has stochastic error, adding additional 
parameters to the function will, in general, decrease the er- 
ror between the function and the data. This does not mean 
the data support adoption of the extra parameters; in the ex- 
treme case, if the function's number of degrees of freedom 
equals that of the data, the error will be zero. Thus one must 
be careful in comparing models with different numbers of 
parameters. Bayesian evidence and Bayes factors provide a 
rigorous way to quantify the trade-off between fewer param- 
eters and smaller error (Trotta 2008 and references therein), 
penalizing models with extra parameters unless they fit the 
data significantly better. It is useful to forecast Bayesian evi- 
dence in advance of data (e.g. Heavens, Kitching, and Verde 
2007; Mukherjee et al. 2006; Trotta 2007) to determine how 
well experiments will be able to rule out models. 



In this work, we show how to use a "razor" based on the 
Kullback-Leibler distance (Balasubramanian 1996, 1997) 
to forecast the Bayes factors between different models of 
the galaxy cluster mass function. Rather than forecasting 
the evidence for particular surveys, we raise the more gen- 
eral question of how much data would be required to dis- 
tinguish among different models of the cluster mass func- 
tion. A mass function model may be written as a probabil- 
ity distribution function with both a continuous part above 
the minimum discernible cluster mass, and a discrete part 
(or a constant probability density) below the cluster mass 
limit. As an illustrative example, we examine only the sim- 
plest, most ideal case, ignoring sample variance, evolution 
effects, and measurement errors, such as in a large N-body 
simulation with well-defined cluster masses at constant red- 
shift. We demonstrate how the ability to distinguish mod- 
els depends on the cluster mass limit. This approach may 
be extended to estimate the minimum size cluster surveys 
required to justify or discount additional parameters in the 
cluster mass function. 
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In Sect. 2 we give an overview of cluster mass function 
models, and in Sect. 3 we define the razor and demonstrate 
its application to models with mixed probability distribution 
functions. We then apply the razor in Sect. 4 to distinguish 
between a two-parameter mass function and one-parameter 
mass functions. We conclude in Sect. 5. 

2 Cluster mass function 

The mass function n(m, z)dm is the comoving number den- 
sity of galaxy clusters with mass between m and m + dm 
at redshift z. As in Sheth and Tormen (1999), we write 

the mass function in terms of the dimensionless parame- 

s 2 

ter v = —s , where 8 C is the critical overdensity for 

collapse, cr 2 (m, z) is the variance of density fluctuations 
smoothed on scales r = (3m/47rp) 1 / 3 , and p is the mean 
matter density of the universe. The mass function is then 
given by the dimensionless function f(v), where 

f(i/)du——n(m,z)dm. (1) 

The mass function thus also gives the probability density 
p(y) that a particle with mass Sm C m is in a cluster with 
a mass parameter between v and v + dv (Manera, Sheth and 
Scoccimarro 2010): p(y)dv = f(y)dv. 

Since the formation of galaxy clusters is a highly non- 
linear problem, the exact shape of the cluster mass func- 
tion in ACDM and alternative cosmologies remains an open 
question of great interest and active research (Bhattacharya 
et al. 2010). Analytical models for the cluster mass func- 
tion have been developed using the excursion set approach 
(Bond et al. 1991; Press and Schechter 1974; Sheth, Mo, 
and Tormen 2001; Zentner 2007 and references therein), 
which differ from fitting functions based on numerical sim- 
ulations presented by, e.g., Bhattacharya et al. 2010; Crocce 
et al. 2010; Evrard et al. 2002; Jenkins et al. 2001; Reed 
et al. 2007; Tinker et al. 2008; Warren et al. 2006. The 
recent fitting functions introduce extra parameters which 
do not currently have an underlying physical interpretation 
(Robertson et al. 2009). We demonstrate a new application 
of Bayesian model selection which can quantify the statis- 
tical significance of the differences among different func- 
tional forms of the cluster mass function. 

3 Bayesian razor for a mixed probability 
distribution 

Within a Bayesian statistical framework, the Bayesian evi- 
dence of different models may be used to compare their rel- 
ative statistical significance given certain data (Trotta 2008). 
For a model M with n parameters 6 — {9i,8 2 , n } and 
a data set of N outcomes v — {u\, fjv} drawn in- 

dependently from a fiducial underlying probability density 
function ptrue(^), the evidence is 

p(v\M) = [ d n eU(0\M) P (u\0,M) (2) 



where 11(0 \A4) is a normalized prior distribution and 
p(v\0, M) is the likelihood. 

The expectation of the log likelihood, (In p(v\9, M)}, 
is equal to the Kullback-Leibler distance between 
the true distribution and the model distribution, 
f(ptruo(i y )||p(i y |0, -M.)), plus a constant that depends 
only on the entropy of the true distribution. Following 
Balasubramanian (1996, 1997), we define the razor of a 
model, R(M): 

R{M) = j d n 9n(6\M)e- ND{pt ™^ p ^ e ' M ». (3) 

The ratio of the razors of two models thus may be used to 
forecast the Bayes factor (the ratio of the evidences) for a 
given fiducial model. 

Suppose we can only distinguish data with values of v 
above a certain limit Vd- (In the context of the mass function, 
Vd will correspond to the minimum cluster mass, known as 
the dust limit.) Define fd to be the fraction of the outcomes 
with v < I'd, that is 

[■v d 

f d = / dvp{u). (4) 
Jo 

This situation results in a mixed probability distribution 
function, with a discrete "bin" for the fraction of data with 
< v < Vd (with a constant probability density ^f-) and 
a continuous probability distribution for data with v > u&. 
The Kullback-Leibler (KL) distance is in this case: 

D(0,M) = D{p Une {v)\\p(v\0,M)) 

In the Laplace approximation for large N (MacKay 
2003; Trotta 2008), we can Taylor expand the KL distance 
around its minimum, keeping the first two terms: 

(6) 

where 5 = — 6, and represents the values of the pa- 
rameters that minimize the KL distance with a given fiducial 
model. Assuming a flat prior, II(0|.A4) is given by the re- 
ciprocal of the volume of the parameter space, H(9\M) = 
(A6 1 A6 2 ...A6 n y 1 . For large N, the likelihood outside the 
boundaries of the parameter space is negligible, and we can 
estimate the razor integral as the integral over the entire 
Gaussian. Noticing that d0 d dg D(9 ,M) equals the Fisher 
matrix (6 0} M) , we can write 

MM) ~ u(e\M)e- ND(e °' M K / ( 2?r )" 
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(7) 



(8) 



The log razor (for a flat prior) is 

71 

In R(M) ~ -ln(A0i...A0„) + -ln27r 

' 71 

--\n& C t{F l3 (0 ,M)) - -hiN 

-ND(0 ,M). 

To compare two models Mi and M 2 , we take the ratio 
of their razors. Suppose model Mi, with m parameters, is 
nested inside model M2, with n > m parameters. Then the 
log razor ratio reduces to 



In 



R(M 2 ) 



ln(A0 m+ i...A0„) 



n — m 



ln(27r) 



1, \F t3 (e 02 ,M 2 )\ 

2 n |F ij -(ff i,Mi)| 
+ 1 — ^— ]InJV-JV(£)oi-Do2) 



(9) 



where D 01 = D(0 ol ,Mi) and D Q2 = D{9 02 ,M 2 ) are 
the minimum KL distances for each model. This expression 
is equivalent to the log of Eq. (10) in Heavens et al. (2007). 
Note that a positive log razor ratio favors model Mi and 
a negative log razor ratio favors model M 2 . The KL dis- 
tance equals zero when the model distribution equals the 
true distribution. If the more complicated model M 2 is true, 
then Dqi > Dq 2 = 0, and the more complicated model 
will be favored for large N with a log razor ratio decreas- 
ing linearly with N. If the simpler model Mi is true, then 
Dqi = D 02 = (since Mi is nested inside M 2 ). In this 
case, for large N, the simpler model will be favored with a 
log razor ratio proportional to ln(7Y). 

The razor, like Bayesian evidence, effectively measures 
how much of the prior volume in parameter space is taken 
up by the posterior distribution. For flat priors and Gaus- 
sian likelihoods, in the limit of large N, the integral of the 
posterior over the parameter space will be practically in- 
dependent of the boundaries of the parameter space. How- 
ever, the parameter-space volume, A61 . . . A6 n , depends sen- 
sitively on these bounds; the greater this volume, the smaller 
the razor. Thus the razor ratio will depend on the choice 
of parameter ranges or prior distribution considered. Any 
broad, smoothly varying prior may be approximated as a 
constant near the peak of the likelihood term (that is, the 
peak of e~ ND ^ e ° ,M ') j n the limit of large N, and thus may 
be estimated in the Laplace approximation by a flat prior 
with value H(9 \M) (or an effective parameter-space vol- 



ume of ■ 



n(9 \M) J 

The razor also depends on how narrowly peaked the 
posterior is, through det (Fij). A greater value of det(F^) 
means greater curvature at the maximum of the posterior (a 
narrower peak) and thus a smaller volume under the poste- 
rior. So if two models have the same prior parameter- space 
volume and the same maximum likelihood, the model with 
the larger value of det (F^) will have a smaller razor. This 



is slightly counter-intuitive because one might naively ex- 
pect a "narrower" model to be "simpler" and thus favored 
by Occam's razor. However, the razor is concerned with the 
ratio of posterior volume to prior parameter- space volume, 
and the "narrower" model wastes more of the available pa- 
rameter space. 

In the limit of large N, the log razor in Eq. (8) ap- 
proaches the expectation of the Bayesian Information Cri- 
terion (BIC) times negative one-half (cf. Trotta 2008, Eq. 
37). 1 

4 Applying the razor to the cluster mass 
function 

The razor allows comparison of different functional forms 
of the cluster mass function f{v). To demonstrate, we com- 
pare the two-parameter Sheth-Tormen (ST) mass function 
(Sheth and Tormen 1999), 



/M« t /-(i + Kr)e^- 



(10) 



with one-parameter mass functions keeping a or p constant: 



(with p = 0) or 



fiy) cx 



2-kv 



(11) 



(12) 



(with a = 1). Note that the Press-Schechter (PS) mass func- 
tion (Press and Schechter 1974) corresponds to a — 1 and 
p = 0. 

We seek to estimate N m i n , the minimum number of 
particles (data) needed for strong evidence in favor of the 
"true" or fiducial model. On a Jeffreys scale (e.g. Trotta 



hi 



R(Mi) 



> 5. 



2008), strong evidence corresponds to m R ^ M ^ 
From Eq. (9), iV m j n will clearly depend on the chosen 
prior volume for the extra parameters in M 2 . If the simpler 
model Mi is the true model, iV min will scale as N m i n oc 
(A9 m+1 ...Ad n )^. 

We calculate how N m i n depends on the dust limit Vd, as- 
suming parameter ranges of Aa = 1 and Ap = 1. Figure 1 
shows N min versus v d needed to distinguish between the 
ST model and the one-parameter models: the left plot for a 
fiducial ST model with a = 0.7 and p = 0.3, and the right 
plot for a PS fiducial model. The blue solid lines correspond 
to the one-parameter model with p = (Eq. (11)), and the 
red dashed lines, the one-parameter model with a = 1 (Eq. 
(12)). 



1 Note that the popular Akaike Information Criterion (AIC) is based 
on a fundamentally different model-comparison approach, which estimates 
the expected error in the maximum log likelihood for each model (cf. Lid- 
die 2009; Takeuchi 2000). 
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Fig. 1 N m in vs. Ud where model M2 is the 2-parameter ST model. For the blue solid line, model Mi is the ST model with only a 
as a free parameter (p is fixed at p = 0). For the red dashed line, model Mi is the ST model with only p as a free parameter (a is 
fixed at a = 1). Left: The fiducial model is ST with a — 0.7 and p = 0.3 (requiring both parameters). Right: The fiducial model is the 
Press-Schechter model (a = l,p — 0). 




Fig. 2 "^° x vs. mi in solar masses, where model M2 is the 2-parameter ST model. For the blue solid line, model Mi is the ST 
model with only a as a free parameter (p is fixed at p — 0). For the red dashed line, model Mi is the ST model with only p as a free 
parameter (a is fixed at a = 1). Left: The fiducial model is ST with a = 0.7 and p = 0.3 (requiring both parameters). Right: The fiducial 
model is the Press-Schechter model (a = l,p — 0). 



For the case where the two-parameter ST model is the 
fiducial model (Fig. 1 left), the razor ratio will be nega- 
tive and dominated by the KL distance between the sim- 
pler model and the fiducial model (— NDq{) for large N. 
Equation (11) can more closely approximate the fiducial ST 
model at large v, but Eq. (12) can more closely approximate 
the fiducial model at small v. We can thus see why, for small 
Vd, it takes more particles to distinguish the fiducial model 
from Mi given by Eq. (12) than Eq. (11): the value of Doi 
is smaller for model Eq. (12). For large values of Vd, the sit- 
uation is reversed, and model Eq. (1 1) is harder to rule out 
than model Eq. (12). 

For the case where the PS model is true, the mini- 
mum KL distance between the fiducial model and all the 
parametric models is zero, and the razor ratio is positive 
and dominated by the In AT term for large N. The abil- 
ity to distinguish the simpler models depends on the value 
of \Fij {doi, Mi)\, that is, the curvature of the log like- 
lihood near its maximum; the greater the curvature, the 
lower the evidence of the simpler model compared to the 
ST model, and the more particles are needed to prefer the 



simpler model. Model Eq. (11) (p — 0) deviates from PS 
more quickly at high v, and thus has a higher value of 
\Fij(9oi, Mi)\ than model Eq. (12) (a = 1) for high values 
of Ud- 

We can relate the dust limit Vd to a mass limit md by 
assuming a standard ACDM cosmology, approximating the 
power spectrum using the fitting function given by Eq. (7) 
in Efstathiou, Bond, and White (1992) (see also Bond and 
Efstathiou 1984) with h = 0.71 and n m = 0.27, and taking 
5 C = 1.69. We assume that a minimum of 100 particles form 
a cluster, and set the mass of a particle to be 8m = m^/lOO. 
Then we convert the minimum number of particles N m i n to 
the mass of these particles, mbox> via ^box = N "ioo ld ■ F° r 
Wbox 3> TUd, Tibox gives the minimum size of a simulation 
box needed to distinguish among models, given a dust limit 
and a fiducial underlying model. 2 These results, effectively 

2 Our analysis assumes the N particles are drawn from clusters of all 
possible masses, using the likelihood function from Manera et al. (2010) 
Appendix A. When TO tw» ~ a few, however, there are clearly just a few 
clusters in the simulation box, and in that case, mt, ox provides a lower 
limit on the simulation size actually needed to distinguish models. 
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just a transformation of the graphs in Fig. 1, are shown in 
Fig. 2. Clearly, the size of the simulation needed to distin- 
guish models increases with the minimum cluster mass ma. 
We see that a larger (smaller) number of particles in Fig. 
1 corresponds to a larger (smaller) ratio of the simulation 
mass to the minimum cluster mass required to distinguish 
models. 

5 Discussion 

We have demonstrated a new application of the Bayesian ra- 
zor to estimate the necessary N-body simulation size to dis- 
tinguish among different models of the cluster mass func- 
tion, with different numbers of free parameters. Our ap- 
proach quantifies how this simulation volume depends on 
the minimum cluster mass md- Reducing the dust limit sig- 
nificantly enhances the ability to distinguish models, as the 
mass of the simulation in units of ma increases by roughly 
an order of magnitude as increases from 1O 12 M to 
10 14 M Q . In general, it is much more difficult to have strong 
evidence against a complicated model than to strongly favor 
it. This is because the log razor ratio goes as In TV when the 
simpler model is true, but goes as TV when the more com- 
plicated model is true. In our examples, simulations must be 
thousands of times larger to favor the true model for fiducial 
one-parameter models than for the fiducial two-parameter 
model. Future work will extend our analysis to compare 
higher-dimensional mass function parameterizations and to 
incorporate redshift-evolution effects, sample variance, and 
measurement errors for cluster surveys. We note that the ap- 
plication of the razor to a mixed probability distribution may 
also be used in other applications of Bayesian model com- 
parison where certain ranges of data are binned. 
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